Learning from Binary Multiway Data: Probabilistic Tensor Decomposition and its Statistical Optimality

We consider the problem of decomposing a higher-order tensor with binary entries. Such data problems arise frequently in applications such as neuroimaging, recommendation system, topic modeling, and sensor network localization. We propose a multilinear Bernoulli model, develop a rank-constrained likelihood-based estimation method, and obtain the theoretical accuracy guarantees. In contrast to continuous-valued problems, the binary tensor problem exhibits an interesting phase transition phenomenon according to the signal-to-noise ratio. The error bound for the parameter tensor estimation is established, and we show that the obtained rate is minimax optimal under the considered model. Furthermore, we develop an alternating optimization algorithm with convergence guarantees. The efficacy of our approach is demonstrated through both simulations and analyses of multiple data sets on the tasks of tensor completion and clustering.


Motivation and proposal
In recent years, data in the form of multidimensional arrays, a.k.a. tensors, are frequently arising and gaining increasing attention in numerous fields, such as genomics (Hore et al., 2016;Wang et al., 2017c), neuroscience , recommender systems (Sun et al., 2017;Bi et al., 2018), social networks (Nickel et al., 2011), computer vision (Tang et al., 2013), among many others. An important reason for such an increase is the effective representation of multiway data using a tensor structure. One example is the recommender system (Bi et al., 2018), which can be naturally described as a three-way tensor of user × item × context and each entry indicates the user-item interaction. Another example is the DBLP database (Zhe et al., 2016), which is organized into a three-way tensor of author × word × venue and each entry indicates the co-occurrence of the triplets.
Whereas many real-world multiway datasets have continuous-valued entries, there have recently emerged more instances of binary tensors, in which all tensor entries are binary indicators 0/1. Examples include click/no-click action in recommender systems (Sun et al., 2017), multi-relational social networks (Nickel et al., 2011), and brain structural connectivity networks (Wang et al., 2017a). These binary tensors are often noisy and high-dimensional. It is thus crucial to develop effective tools that reduce the dimensionality, take into account the tensor formation, and learn the underlying structures of these massive discrete observations. A number of successful tensor decomposition methods have been proposed (Kolda and Bader, 2009;Anandkumar et al., 2014;Wang and Song, 2017;Zhang and Xia, 2018), revitalizing the classical methods such as CAN-DECOMP/PARAFAC (CP) decomposition (Hitchcock, 1927) and Tucker decomposition (Tucker,

Related work
Our work is closely related to but also clearly distinctive from several lines of existing research, including continuous-valued tensor decomposition, binary matrix decomposition, 1-bit matrix/tensor completion, and Bayesian tensor decomposition.
The first line of relevant research concerns continuous-valued tensor decomposition. In principle, one can apply the existing decomposition methods that were designed for continuous-valued tensor to binary tensor, by pretending the 0/1 entries were continuous. However, such an approach is bound to yield an inferior performance: flipping the entry coding 0 ↔ 1 would totally change the decomposition result, and the predicted values for the unobserved entries could fall outside the valid range [0,1]. While our solution specifically targets binary tensor, we also exploit the connection between continuous-valued tensor decomposition and binary tensor decomposition. We establish this connection by adopting a latent variable perspective, and show that our binary tensor model is equivalent to an entrywise quantization of a noisy continuous-valued tensor. Such a connection allows us to fully characterize the impact of signal-to-noise ratio (SNR) on the tensor recovery accuracy, as we identify three different phases for tensor recovery according to SNR; see Table 1 in Section 4.3. When SNR is bounded by a constant, the loss in binary tensor decomposition is comparable to the case of continuous-valued tensor, suggesting very little information is lost by quantization. On the other hand, when SNR is sufficiently large, stochastic noise turns out to be helpful, and is in fact essential, for estimating the low-rank signal tensor. This phenomenon is called the "dithering" effect of random noise (Davenport et al., 2014), and is clearly contrary to the behavior of continuous-valued tensor decomposition.
The second line is binary matrix decomposition. It has been introduced and studied in the context of binary or logit principal component analysis (PCA) (Collins et al., 2002;De Leeuw, 2006;Lee et al., 2010). The method utilizes the Bernoulli likelihood as the objective function and represents the canonical parameter matrix Θ ∈ R d 1 ×d 2 through its rank-R singular value decomposition: Θ = θ ij = U ΣV T , where θ ij is the logit transformation of the success probability for entry (i, j), U ∈ R d 1 ×R , V ∈ R d 2 ×R and Σ is the R-dimensional diagonal matrix. While tensors are conceptual generalization of matrices, it has been shown that higher-order tensors possess distinct characteristics compared to matrices (Kolda and Bader, 2009). We first note that, for the matrix case, the rank R ≤ min(d 1 , d 2 ), and the columns of U and V T are constrained to be orthogonal for the identification purpose. Such constraints, however, are too restrictive for tensors, since the uniqueness of tensor decomposition holds under much milder conditions (Kruskal, 1977;Bhaskara et al., 2014). Unlike matrices, tensors do not admit classical orthogonal decomposition, and the tensor rank R can exceed the dimension. These different properties make the earlier algorithms that built upon matrix decomposion inapplicable to tensors. On the other hand, if we were to apply the matrix version of binary PCA to a tensor by unfolding the tensor into a matrix, the result is clearly suboptimal. Indeed, in Section 4.1 we show that, for the scenario when the tensor dimensions are the same in all modes, d 1 = . . . = d K = d, the convergence rate for estimating Θ using our tensor method is O(d −K+1 ), while the "best" unfolding solution that unfolds a tensor into a near-square matrix (Mu et al., 2014) gives the convergence rate O(d − K/2 ), where K/2 is the integer part of K/2. This gap between the two rates highlights the importance of binary decomposition that specifically takes advantage of the multi-mode structure in tensors.
The third line is 1-bit matrix completion (Cai and Zhou, 2013;Davenport et al., 2014;Lafond, 2015) and its recent extension to 1-bit tensor completion (Ghadermarzy et al., 2018). The problem is to recover a matrix or tensor from incomplete observations of its entries. The observed entries are highly quantized, sometimes even to a single bit. Thus the problem turns to recover a signal matrix or tensor based on the noise corrupted signs of a subset of its entries. A structural assumption such as approximate low-rankness was often imposed on the signal. We first show in Section 2.2 that our binary tensor model has an equivalent interpretation as the threshold model commonly used in 1-bit quantization. We then compare the convergence rates of the two methods in Section 4.1. When the dimensions are the same in all modes, i.e., d 1 = . . . = d K = d, the convergence rate of 1-bit tensor completion was O(d −(K−1)/4 ) (Ghadermarzy et al., 2018) . By comparison, the rate of our method is O(d −(K−1)/2 ), which is faster. We further show that our estimator is rate optimal for the class of tensors with a CP structure and a bound rank R.
Last but not least, there have been a number of Bayesian binary tensor decomposition algorithms (Nickel et al., 2011;Rai et al., 2014Rai et al., , 2015. Most of those works focused on and were tailored to the specific context of multi-relational learning. Although we take multi-relational learning as one of our applications, we aim to address a general binary tensor decomposition problem, and to reveal some fundamental properties of the problem, such as the SNR phase diagram and minimax rate. As such, our method is different from those context-specific Bayesian tensor algorithms. Besides, ours is a frequentist type solution, and is computationally more tractable than a Bayesian solution.

Notation and organization
We adopt the following notation throughout the article. We use Y = y i 1 ,...,i K ∈ F d 1 ×···×d k to denote an order-K (d 1 , . . . , d K )-dimensional tensor over a filed F. We focus on real or binary tensors, i.e., F = R or F = {0, 1}. The Frobenius norm of Y is defined as Y F = ( i 1 ,...,i K y 2 i 1 ,...,i K ) 1/2 , and the infinity norm of Y is defined as Y ∞ = max i 1 ,...,i K |y i 1 ,...,i K |. We use A ⊗ B to denote the kronecker product of the matrices A and B, and A B to denote the Khatri-Rao product of A and B. We use S d−1 = {x ∈ R d : x 2 = 1} to denote the (d − 1)-dimensional unit sphere, and the shorthand n-set {1, ..., n} to denote n ∈ N + . The rest of the article is organized as follows. We propose the low-rank binary tensor model and discuss its connection with 1-bit observation model in Section 2. We develop a likelihood-based estimation procedure in Section 3. We study the theoretical properties in Section 4, including the upper bound and the minimax lower bound on the tensor recovery accuracy, with a particular focus on the logistic, probit and Laplacian models. We present the simulations in Section 5, and the real-world data analyses in Section 6. We conclude the paper with a discussion in Section 7.

Low-rank binary tensor model
For the binary tensor Y = y i 1 ,...,i K ∈ {0, 1} d 1 ×···×d K , we assume its entries are realizations of independent Bernoulli random variables, in that In this model, f : R → [0, 1] is a strictly increasing function. We further assume that f (θ) is twice-differentiable in θ ∈ R/{0}; f (θ) is strictly increasing and strictly log-concave; and f (θ) is unimodal and symmetric with respect to θ = 0. All these assumptions are fairly mild. In the GLM language, f is often referred to as the "inverse link function". When no confusion arises, we also call f the "link function". The parameter tensor Θ = θ i 1 ,...,i K ∈ R d 1 ×···×d K is continuous-valued, unknown, and is the main object of interest in our tensor estimation inquiry. We also assume that the entries of Y are mutually independent conditional on Θ. This assumption is again commonly adopted in the literature (Collins et al., 2002;De Leeuw, 2006;Lee et al., 2010).
Furthermore, we assume the parameter tensor Θ admits a rank-R CP decomposition, where λ r ∈ R + , with λ 1 ≥ . . . ≥ λ K without loss of generality, a (k) , and Θ cannot be written as a sum of fewer than R outer products. This low-rank structure in (2) dramatically reduces the number of parameters in Θ, from the order of k d k to the order of k d k . More precisely, the effective number of parameters in (2) is p e = R (d 1 + d 2 ) − R 2 for K = 2 (matrices) after adjusting for the nonsingular transformation indeterminacy, and p e = R ( k d k − K + 1) for K ≥ 3 (higher-order tensors) after adjusting for the scaling indeterminacy . This structure is an effective and frequently used tool in high-dimensional tensor analysis, and is shown to provide a reasonable tradeoff between model complexity and model flexibility (Chen et al., 2016). Moreover, this structure seems plausible in numerous scientific applications. For instance, in sensor network locations, DNA haplotype assembly, and brain imaging analysis, it is often found that the rank of Θ is small, is independent of the tensor dimension, and offers a reasonable approximation to the truth Bhaskar and Javanmard, 2015).
Combining (1) and (2) leads to our low-rank binary tensor model. It can be viewed as a generalization of the classical CP decomposition for continuous-valued tensors to binary tensors, in a way that is analogous to the generalization from a linear model to a GLM. We then seek to estimate the rank-R tensor Θ given the observed binary tensor Y.

Latent variable model interpretation
Before turning to estimation, we show that our binary tensor model (1) has an equivalent interpretation as the threshold model commonly used in 1-bit quantization (Davenport et al., 2014;Bhaskar and Javanmard, 2015;Cai and Zhou, 2013;Ghadermarzy et al., 2018). The later viewpoint sheds light on the nature of the binary (1-bit) measurements from the information perspective.
Consider an order-K tensor Θ = θ i 1 ,...,i K ∈ R d 1 ×···×d K with a rank-R CP structure. Suppose that we cannot not directly observe Θ. Instead, we observe the quantized version Y = y i 1 ,...,i K ∈ {0, 1} d 1 ×···×d K , such that, where E = ε i 1 ,...,i K is a noise tensor. In this setting, we view the tensor Θ as more than just parameters of the distribution of Y. Instead, Θ represents an underlying, continuous-valued quantity whose noisy discretization gives Y. This latent model (3) in fact is equivalent to our binary tensor model (1) if the link f behaves like a cumulative distribution function. Specifically, for any choice of f in (1), if we define E as having i.i.d. entries drawn from a distribution whose cumulative distribution function is defined by P(ε ≤ θ) = 1 − f (−θ), then (1) reduces to (3). Conversely, if we set the link function f (θ) = P(ε ≥ −θ), then model (3) reduces to that in (1). We next consider three choices of f , or equivalently, the distribution of E.

Rank-constrained likelihood-based estimation
We propose to estimate the unknown parameter tensor Θ in model (1) using a constrained likelihood approach. The negative log-likelihood function for (1) is where q i 1 ,...,i K = 2y i 1 ,...,i K − 1 takes values -1 or 1, and the second equality is due to the symmetry of the link function f . We next incorporate the CP structure (2) and consider a constrained minimization procedure: for a given rank R ∈ N + and a bound α ∈ R + . Here the feasible set D consists of tensors defined by two constraints. The first is that Θ admits the CP structure (2) with rank R. As discussed in Section 2.1, the low-rank structure (2) is a commonly used dimension reduction tool in tensor data analysis. Moreover, all the subsequent estimation and theoretical results hold as long as the working rank is no smaller than the true rank. The second constraint is that all the entries of Θ are bounded in absolute value by a constant α ∈ R + . We refer to α as the "signal" bound of Θ. This infinity-norm condition is a technical assumption to aid the recovery of Θ in the noiseless case. Similar conditions have been employed for the matrix case (Davenport et al., 2014;Bhaskar and Javanmard, 2015;Cai and Zhou, 2013). We note that, the objective function L Y (Θ) is convex in Θ when the link function f is log-convex in θ i 1 ,...,i K . However, the feasible set D is nonconvex, and thus the optimization (4) is a non-convex problem. We next utilize a formulation of the CP decomposition, and turn the optimization into a convex problem for each block of parameters while fixing the others. Specifically, write the mode-k factor matrices from (2) as where we arbitrarily choose to rescale λ k 's into the last factor matrix, which loses no generality. Then the optimization (4) can be rewritten as We next develop an alternating block relaxation algorithm to solve (5).

Block relaxation optimization algorithm
The key to solve (5) is to note that, although the objective function in (5) is in general not convex in the K factor matrices jointly, it is convex in each factor matrix individually with all other factor matrices fixed. This feature enables a block relaxation type minimization, where we alternatively update one factor matrix at a time while keeping the others fixed. In each iteration, the update of each factor matrix involves solving a number of GLMs separately. To see this, let A (t) k denote the kth factor matrix at the tth iteration, and Let Y(:, j(k), :) denote the sub-tensor of Y at the jth position of the kth mode, j = 1, . . . , d k , k = 1, . . . K, and vec(·) is the operator that turns a tensor into a vector. Then the update A (t+1) k can be obtained row-by-row by solving d k separate GLMs, where each GLM takes vec{Y(:, j(k), :)} ∈ as the "predictors", and the jth row of A k as the "regression coefficient", j = 1, . . . , d k , k = 1, . . . K. In each GLM, the effective number of predictors is R, and the effective sample size is i =k d i . These separable, low-dimensional GLMs allow us to leverage the standard GLM solvers such as those in R/Python/MATLAB, as well as Algorithm 1 Binary tensor decomposition Input: Binary tensor Y ∈ {0, 1} d 1 ×···×d K , link function f , rank R, and entrywise bound α. Output: Rank-R coefficient tensor Θ, along with the factor matrices (A 1 , . . . , A K ).

4:
for k = 1 to K do 5: Obtain A Line search to obtain γ * .

9:
Normalize the columns of A (t+1) k to be of unit-norm for all k ≤ K − 1, and absorb the scales into the columns of A , and normalize the columns of A (t+1) k . We summarize the full optimization procedure in Algorithm 1, and comment on a few implementation details.
First, as the objective function monotonically decreases at each iteration, and it is bounded below in the search domain, the convergence of Algorithm 1 is assured. This is in contrast to the gradient-based approach (Hong et al., 2018) which only explores first-order information on the objective. On the other hand, we recognize that obtaining the global minimizer for such a nonconvex optimization is typically difficult, even for the matrix case (Collins et al., 2002). As such, we usually run the algorithm from multiple initializations, especially for a higher rank and a limited sample size. In our simulations, we have observed that the algorithm often converges to the same minimum from different initial values in only a few iterations; see Section 5.1.
Second, the computational complexity of our algorithm is O(R 3 k d k ) for each iteration. In other words, the per-iteration computational cost scales linearly with the tensor dimension, and this complexity matches the classical continuous-valued tensor decomposition. More precisely, the update of A k involves solving d k separate GLMs. Solving those GLMs requires O(R 3 d k +R 2 k d k ), and therefore the cost for updating K factors in total is O( . We further report the computation time in Section 5.1.
Third, when some tensor entries y i 1 ,...,i K are missing, we replace the objective function is the index set for non-missing entries. That is, we model the observed entries only, and exclude the missing entries in the model fitting. This same strategy to handle missing values has been used for continuous-valued tensor decomposition (Acar et al., 2010). For implementation, we modify line 5 in Algorithm 1, by fitting GLMs to the data for which y i 1 ,...,i K are observed. Other steps in Algorithm 1 are amendable to missing data in a straightforward way. This approach requires that there are no completely missing sub-tensors Y(:, j(k), :), which is a fairly mild condition. This is similar to the coherence condition in the matrix completion problem; for instance, if an entire row or column of a matrix is missing, it is impossible to recover its true decomposition.
Our procedure provides a straightforward way to handle missing data. As a by-product, our tensor decomposition output can also be used for missing value prediction. That is, we predict the missing values Y i 1 ,...,i K using f (Θ i 1 ,...,i K ), whereΘ is the low-rank coefficient tensor estimated from the observed entries. Note that the predicted values are alway between 0 and 1, which can be interpreted as a prediction for P(Y i 1 ,...,i K = 1).
Finally, Algorithm 1 requires the rank of Θ as an input. Estimating an appropriate rank given the data is of practical importance for our approach. We adopt the usual Bayesian information criterion (BIC), and choose the rank that minimizes BIC; i.e., whereΘ(R) is the estimated coefficient tensor Θ under the working rank R, and p e is the effective number of parameters. The empirical performance of this criterion is investigated in Section 5.1. (4), and we call it the maximum likelihood estimator as it minimizes the negative log-likelihood function. In this section, we establish the performance upper bound forΘ MLE . We first introduce two quantities. Define

Performance upper bound
, and γ α = inf |θ|≤α ḟ 2 (θ) whereḟ (θ) = df (θ)/dθ, and α is the bound on the entry-wise infinity-norm of Θ. Note that both f andḟ are non-zero in [−α, α]. Intuitively, the quantity L α controls the "steepness" of the link function f , and γ α controls the "convexity" of f . When α is a fixed constant and f is a fixed function, all these quantities are bounded by some fixed constants independent of the tensor dimension. Moreover, for the logistic, probit and Laplacian models, we have Logistic model: Laplacian model: We assess the estimation accuracy using the deviation in the Frobenius norm. For the true coefficient tensor Θ true ∈ R d 1 ×···×d K and its estimatorΘ, define The next theorem establishes the upper bound forΘ MLE under model (1).
Theorem 1. Suppose Y ∈ {0, 1} d 1 ×···×d K is an order-K binary tensor following model (1), with the link function f , and the true coefficient tensor Θ true ∈ D. Then there exist an absolute constant C 1 > 0, and a constant C 2 > 0 that depends only on K, such that, with probability at least 1 − exp (−C 1 log K k d k ), Note that f is strictly log-concave if and only iff (θ)f (θ) <ḟ (θ) 2 (Boyd and Vandenberghe, 2004). Henceforth, γ α > 0 and L α > 0, which ensures the validity of the bound in (6).
To gain further insight into this error bound, we consider a special setting where the dimensions are the same in all modes, i.e., d 1 = . . . = d K = d. In such a case, our bound (6) reduces to for a fixed rank R and signal bound α as d → ∞. Meanwhile, the error bound for 1-bit tensor completion was (Ghadermarzy et al., 2018) Loss Comparing (7) and (8), we see that our bound has a faster convergence rate. This rate improvement comes from the fact that we impose an exact low-rank structure on Θ, whereas Ghadermarzy et al. (2018) employed the max norm as a surrogate rank measure. Our bound also generalizes the previous results on low-rank binary matrix completion. The convergence rate for rank-constrained matrix completion was O(1/ √ d) (Bhaskar and Javanmard, 2015), which fits into our special case when K = 2. Intuitively, for tensor data, we can view each tensor entry as a data point, and the total number of entries would correspond to the sample size. Compared to the matrix case, a higher-order tensor has a larger number of data points and thus exhibits a faster convergence rate as d → ∞.
We next present the explicit form of the upper bound (6) when the link f is a logistic, probit, or Laplacian function.
Corollary 1. Assume the same setup as in Theorem 1. Then there exist an absolute constant C 1 > 0, and two constants C, C > 0 that depend only on K, such that, with probability at least 1 − exp (−C 1 log K k d k ), (i) for the logistic link, (ii) for the probit link, (iii) for the Laplacian link This corollary follows directly from Theorem 1. We further discuss the dependency of the above error bounds on the signal bound α and the noise level σ in Section 4.3.

Information-theoretical lower bound
We next establish two lower bounds. With a little abuse of notation, we let D(R, α) denote the set of tensors with rank bounded by R and infinity norm bounded by α. The first lower bound is for any statistical estimatorΘ ∈ D(R, α), including but not necessarily the constrained MLEΘ MLE , under the binary tensor model (1). We show that this lower bound nearly matches the upper bound on the estimation accuracy ofΘ MLE , and thus establishes the rate optimality ofΘ MLE . The second lower bound is for any estimatorΘ from the "unquantized" version of the continuousvalued tensor decomposition. This allows us to evaluate how much information is lost by quantizing a continuous-valued tensor to a binary one. The next theorem establishes the information-theoretical lower bound for any estimatorΘ in D(R, α) under model (1).
Theorem 2. Suppose Y ∈ {0, 1} d 1 ×···×d K is an order-K binary tensor following model (1), with a probit link function f , and the true coefficient tensor Θ true ∈ D(R, α). Suppose that R ≤ min k d k and the dimension max k d k ≥ 8. LetΘ denote an estimator of Θ true . Then there exist absolute constants β 0 ∈ (0, 1) and c 0 > 0, such that We only present the result for the probit model, while similar results can be obtained for the logistic and Laplacian model as well. We comment that, in this theorem, we assume that R ≤ d min . This condition is automatically satisfied in the matrix case, since the rank of a matrix is always bounded by its row and column dimension. For the tensor case, this assertion may not always hold. However, in the majority applications, the tensor dimension is large, whereas the rank is relatively small. As such, we view this as a mild condition. By contrast, in Theorem 1, we do not place any constraint on the rank R. We now compare the lower bound (10) to the upper bound (9), as the tensor dimension d k → ∞ while the signal bound α and the noise level σ are fixed. Since d max ≤ k d k ≤ Kd max , both the bounds are of the form C √ d max ( k d k ) −1/2 , where C is a factor that does not depend on the tensor dimension. Henceforth, the estimatorΘ MLE is rate-optimal. This result holds for a general tensor order K, and it immediately suggests the superiority of the tensor solution compared to a matrix-unfolding based solution. For instance, if we unfold an order-3 dimension-(d, d, d) tensor into a matrix and apply a matrix decomposition to estimate Θ, then the optimal convergence rate would be O(d −1/2 ). This is in sharp contrast to the rate O(d −1 ) using a tensor decomposition approach.
In Section 2.2, we show that our binary tensor model can be viewed as an entrywise quantization of a noisy continuous-valued tensor. We next consider an "unquantized" version of the model where the noisy entries,Ỹ = Θ + E, are observed directly without any quantization. We seek an estimatorΘ by "denoising" the continuous-valued observation. The lower bound is obtained via an information-theoretical argument and is applicable to any estimatorΘ ∈ D(R, α).
Theorem 3. Suppose Y ∈ R d 1 ×···×d K is an order-K continuous-valued tensor from the model Y = Θ true + E ∈ R d 1 ×···d K , where Θ true ∈ D(R, α) and E is a tensor of i.i.d. Gaussian entries. Suppose that R ≤ min k d k and max k d k ≥ 8. LetΘ denote an estimator of Θ true . Then there exist absolute constants β 0 ∈ (0, 1) and c 0 > 0 such that This lower bound (11) quantifies the intrinsic hardness of the problem. In the next section, we explicitly evaluate the information loss of our tensor estimation method based on the quantized binary data, by comparing to the case when the latent tensorỸ is fully observed without any quantization.

Phase diagram
The error bounds we have established depend on the signal bound α and the noise level σ. In this section, we define three regimes based on the signal-to-noise ratio, SNR = Θ ∞ /σ, in which the tensor estimation exhibits different behaviors. We borrow the term "phase diagram" from chemistry to describe those different regimes, or phases. Table 1 summarizes the error bounds of the three phrases under the case when d 1 = . . . = d K = d. Our discussion focuses on the probit model, but similar patterns also hold for the logistic and Laplacian models. The first phase is when the noise is weak, in that σ α and SNR O(1). In this regime, we see that the error bound in (9) scales as σ exp(α 2 /σ 2 ), suggesting that increasing the noise level would lead to an improved tensor estimation accuracy. This behavior may seem surprising. But such a phenomenon is not an artifact of the proof, but instead is intrinsic to 1-bit quantization. We also confirm this behavior in simulations in Section 5.1. In fact, when σ goes to zero, the problem essentially reverts to the noiseless case where an accurate estimation of Θ becomes impossible. To see this, we consider a simple example with a rank-1 signal tensor in the latent model (3). Two different coefficient tensors, Θ 1 = a 1 ⊗ a 2 ⊗ a 3 and Θ 2 = sign(a 1 ) ⊗ sign(a 2 ) ⊗ sign(a 3 ), would lead to the same observation Y in the absence of noise, and thus recovery of Θ from Y becomes hopeless. Interestingly, adding a stochastic noise E to the signal tensor prior to 1-bit quantization completely changes the nature of the problem, and an efficient estimate can be obtained through the likelihood approach. In the 1-bit matrix/tensor completion literature, this phenomenon is called the "dithering" effect of random noise (Davenport et al., 2014).
The second phase is when the noise is comparable to the signal, in that O(1) SNR O(d −(K−1)/2 ). In this regime, the error bound in (9) scales linearly with σ. Comparing the lower bound (11) when estimating the signal from the unquantized tensor, to the upper bound (9) when estimating from a quantized one, the two error bounds match with each other. This suggests that 1-bit quantization induces very little loss of information towards the estimation of Θ in this regime. In other words,Θ MLE , which is based on the quantized tensor, can achieve the same degree of accuracy as if it were given access to the completely unquantized measurements.
The third phase is when the noise completely dominates the signal, in that SNR O(d −(K−1)/2 ). In this regime, a consistent estimation of Θ becomes impossible. It is worthy to note that we utilize the maximum norm to represent the signal level. For the goal of estimating Θ itself, this quantity provides an accurate characterization of the fundamental limit of the problem. Our framework is related but distinct from the Tucker low-rank tensors in literature. For example, in a different context of completion of a continuous-valued tensor, Zhang and Xia (2018) employed λ, the smallest singular value of matricization of Θ, to characterize the problem of estimating the singular subspaces of a low-rank, continuous-valued tensor and thus the whole tensor itself. They showed that the impossible regime for a consistent estimation of singular subspaces is λ/σ O(d 1/2 ) for an order-3 tensor.

CP tensor model
We first study the finite-sample performance of our binary tensor decomposition method under the CP tensor model. Specifically, we consider an order-3 dimension-(d, d, d) binary tensor Y generated from the threshold model (3) r , and the entries of a k r are i.i.d. drawn from a standard normal distribution. Without loss of generality, we scale Θ true such that Θ true ∞ = 1. The binary tensor Y is generated based on the entrywise quantization of the latent tensor (Θ true + E), where E consists of i.i.d. Gaussian entries. We vary the rank R ∈ {1, 3, 5}, the tensor dimension d ∈ {20, 30, . . . , 60}, and the noise level σ ∈ {10 −3 , 10 −2.5 , . . . , 10 0.5 }. We employ the logistic link function, and have experimented with different values of α and obtained almost the same results. We report the Frobenius error averaged across n sim = 30 replications. Figure 1a plots the estimation error Loss(Θ true ,Θ MLE ) as a function of the tensor dimension d while holding the noise level fixed at σ = 10 −0.5 for three different ranks R ∈ {1, 3, 5}. It is seen that the estimation error of the constrained MLE decreases as the dimension increases. Consistent with our theoretical results, the decay in the error appears to behave on the order of d −1/2 . A higher-rank tensor tends to yield a larger recovery error, as reflected by the upward shift of the curves as R increases. Indeed, a higher rank means a higher intrinsic dimension of the problem, thus increasing the difficulty of the estimation. Figure 1b plots the estimation error as a function of the noise level σ while holding the dimension fixed at d = 50 for three different ranks R ∈ {1, 3, 5}. A larger estimation error is observed when the noise is either too small or too large. The non-monotonic behavior confirms the changes in the three phases with respect to the SNR. Particularly, in the high SNR regime, the random noise is seen to improve the recovery accuracy.
We further assess the effectiveness of using BIC to select the tensor rank. We consider the tensor dimension d ∈ {20, 40, 60} and the noise level σ ∈ {0.1, 0.01} such that the noise is neither negligible nor overwhelming. Table 2 reports the selected rank averaged over n sim = 30 replications, with the standard error shown in the parenthesis. It is seen that, when d ≥ 40, the rank selection is accurate. When d = 20, the selected rank is slightly smaller than the true rank. This agrees with our expectation, as in tensor decomposition, the total number of entries would correspond to the sample size, and a larger d implies a larger sample size. We also evaluate the computational efficiency of our optimization method. Although the block relaxation in Algorithm 1 has no theoretical guarantee to land to the global optimum, in practice, we often find that the converged points have nearly optimal objective values. Figure 2 shows the trajectories of the objective function under different tensor dimensions and ranks. It is observed that the algorithm generally converges quickly upon random initializations, and it usually takes fewer than 8 steps for the relative decrease in the objective to be below 3% even for a large d and R. The average computation time per iteration is shown in the legend. For instance, when i d i = 60 3 and R = 10, each iteration of Algorithm 1 on average takes fewer than 3 seconds.

Block tensor model
We next evaluate our method under a class of three-way block models (Wang et al., 2017c). Under these models, the signal tensors do not always follow the CP structure (2) exactly. This allows us to evaluate the robustness of our method under potential model misspecification. Specifically, we consider the signal tensor Θ true of dimension d = d 1 = d 2 = d 3 = 50. We partition the tensor entries into five blocks along each of the modes, and let Θ true (i, j, k) = µ m 1 m 2 m 3 when i is in the m 1 th block along mode 1, j in the m 2 th block along mode 2, and k in the m 3 th block along mode 3, where m 1 , m 2 , m 3 = 1, . . . , 5. The block means {µ m 1 m 2 m 3 } are generated according to the following three block models: • Additive-mean model: µ m 1 m 2 m 3 = µ 1 m 1 + µ 2 m 2 + µ 3 m 3 , where µ 1 m 1 , µ 2 m 2 and µ 3 m 3 represent the   • Multiplicative-mean model: µ m 1 m 2 m 3 = µ 1 m 1 µ 2 m 2 µ 3 m 3 , and the rest of setup is the same as the additive-mean model.
For the additive-mean model, the signal tensor is not of an exact CP structure, but its rank is bounded by 3. This is because one can write Θ true = µ 1 ⊗ 1 ⊗ 1 + 1 ⊗ µ 2 ⊗ 1 + 1 ⊗ 1 ⊗ µ 3 , where 1 denotes a length-d vector with every entry equal to 1, µ 1 = (µ 1 m(1) , . . . , µ 1 m(d) ) is a length-d vector whose entries are the marginal means along mode 1 with m(·) : [d] → [5] mapping the mode-1 indices to blocks, and likewise for µ 2 and µ 3 . For the multiplicative-mean model, the signal tensor follows the CP structure with R = 1. For the combinatorial-mean model, again, the signal tensor does not admit an exact CP structure, but has its rank bounded by 5 3 . The final binary tensor Y is generated based on entrywise quantization of (Θ true + E), where Θ true is scaled as Θ true ∞ = 1 and E is a noise tensor with i.i.d. standard Gaussian entries. We fit our binary tensor model with the logistic link function, and set the hyper-parameter α to infinity, which essentially poses no prior on the tensor magnitude. We select the rank R using the BIC criterion. Table 3 reports the performance of our method, in terms of the relative loss, the estimated rank, and the running time, averaged over n sim = 30 data replications, for the three block tensor models. The relative loss is computed as Θ MLE − Θ true F / Θ true F . It is seen that our method is able to recover the signal tensors well in all three scenarios. As an illustration, we also plot one typical realization of the true signal tensor, the input binary tensor, and the recovered signal tensor for each model in Table 3. It is interesting to see that, not only the block structure but also the tensor magnitude are well recovered. Note that there are two potential model misspecifications in this example. First, the additive and combinatorial-mean models do not follow an exact CP  structure. Second, the data is generated from a latent Gaussian noise model, but we always fit with a logistic link. Our method is shown to maintain a sound performance under potential model misspecification.

Real-world data
We next illustrate our binary tensor decomposition method using a number of real-world datasets of binary tensors, with applications ranging from social networks, email communication networks, to brain structural connectivities. We consider two tasks: one is tensor completion, and the other is clustering along one of the tensor modes, both of which are based upon the proposed binary tensor decomposition. The datasets include: • Kinship (Nickel et al., 2011): This is a 104 × 104 × 26 binary tensor consisting of 26 types of relations among a set of 104 individuals in Australian Alyawarra tribe. The data was first collected by Denham and White (2005) to study the kinship system in the Alyawarra language. The tensor entry Y(i, j, k) is 1 if individual i used the kinship term k to refer to individual j, and 0 otherwise.
• Nations (Nickel et al., 2011): This is a 14 × 14 × 56 binary tensor consisting of 56 political relations of 14 countries between 1950 and 1965. The tensor entry indicates the presence or absence of a political action, such as "treaties", "sends tourists to", between the nations. We note that the relationship between a nation and itself is not well defined, so we exclude the diagonal elements Y(i, i, k) from the analysis.  Table 4: Tensor completion for the four real-world binary tensor datasets. Two methods are compared: the proposed binary tensor decomposition, and the classical continuous-valued tensor decomposition. The evaluation criteria are averaged over five random splits of the training and testing data.
• Enron (Zhe et al., 2016): This is a 581 × 124 × 48 binary tensor consisting of the three-way relationship, (sender, receiver, time), from the Enron email dataset. The Enron data is a large collection of emails from Enron employees that covers a period of 3.5 years. Following Zhe et al. (2016), we take a subset of the Enron data and organize it into a binary tensor, with entry Y(i, j, k) indicating the presence of emails from a sender i to a receiver j at a time period k.
• HCP (Wang et al., 2017a): This is a 68 × 68 × 212 binary tensor consisting of structural connectivity patterns among 68 brain regions for 212 individuals from Human Connectome Project (HCP). All the individual images were preprocessed following a standard pipeline , and the brain was parcellated to 68 regions-of-interest following the Desikan atlas (Desikan et al., 2006). The tensor entries encode the presence or absence of fiber connections between those 68 brain regions for each of the 212 individuals.
The first task is binary tensor completion, where we applied binary tensor decomposition to predict the missing entries in the tensor. Specifically, we split the tensor entries into 80% training set and 20 % test, while ensuring that the nonzero entries are split the same way between the training and testing data. The entries in the testing data are masked as missing and then predicted based on the tensor decomposition from the training data. We compare our binary tensor decomposition method using a logistic link function with the classical continuous-valued tensor decomposition method. Since the code of 1-bit tensor completion of Ghadermarzy et al. (2018) is not yet available, we exclude it from our numerical comparison. Table 4 reports the area under the ROC curve (AUC) and the mean squared error (MSE), averaged over five random splits of the training and testing data. It is clearly seen from this table that, the binary tensor decomposition substantially outperforms the classical continunous-valued tensor decomposition. In all datasets, the former obtains a substantially higher AUC and mostly a lower MSE. We also report in Table 4 the percentage of nonzero entries for each data. It is seen that our decomposition method performs well even in the sparse setting. For instance, for the Enron dataset, only 0.01% of the entries are non-zero. The classical decomposition almost blindly assigns 0 to all the hold-out testing entires, resulting in a poor AUC of 79.6%. By comparison, our binary tensor decomposition achieves a much higher classification accuracy, with AUC = 94.3%.
The second task is clustering. We have carried out the clustering analysis on two datasets, Nations and HCP. We did not apply to the other two, Kinship and Enron, because there is no annotation information for the individuals in those two datasets. For the Nations dataset, we first apply the proposed binary tensor decomposition method with the logistic link, then apply the Kmeans clustering along the country mode from the decomposition. The BIC criterion selects the Independence Conference Figure 3: Analysis of the Nations dataset. (a) Binary tensor components. Top 9 tensor components in the country mode from the binary tensor decomposition. The overlaid box depicts the results from the K-means clustering. (b) Relation types with large loadings. Top four relationships identified from the top tensor components in the relation mode. rank R = 9, and the classical elbow method suggests 5 clusters. Figure 3a plots the nine tensor factors along the country mode. It is interesting to observe that the countries have been partitioned into one group containing those from the communist bloc, two groups from the western bloc, two groups from the neutral bloc, and Brazil forming its own group. To gain further insight, we also plot the top four relation types based on their loadings in the tensor factors along the relationship mode in Figure 3b. The partition of the countries is seen to be consistent with their relationship patterns in the adjacency matrices. Indeed, those countries belonging to the same group tend to have similar linking patterns with other countries, as reflected by the block structure in Figure 3b.
We also perform the clustering analysis on the dataset HCP. Again we apply the decomposition method with the logistic link first, and the BIC suggests the rank R = 6. Figure 4 plots the heatmap for the top six tensor components across the 68 brain regions, and Figure 5 shows the edges with high loadings based on the tensor components. Edges are overlaid on the brain template BrainMesh ICBM152 (Xia et al., 2013), and nodes are color coded based on their regions. We see that the brain regions are spatially separated into several groups and that the nodes within each group are more densely connected with each other. We also observe some interesting spatial patterns in the brain connectivity. For instance, the edges captured by tensor component 2 are located within the cerebral hemisphere. The detected edges are association tracts consisting of the long association fibers, which connect different lobes of a hemisphere, and the short association fibers, which connect different gyri within a single lobe. In contrast, the edges captured by tensor component 3 are located across the two hemispheres. Among the nodes with high connection intensity, we identify superior frontal gyrus, which is known to be involved in self-awareness and sensory system (Goldberg et al., 2006). We also identify corpus callosum, which is known as the largest commissural tract in the brain that connects two hemispheres. This is consistent with brain anatomy that suggests the key role of corpus callosum in facilitating interhemispheric connectivity (Roland et al., 2017). Moreover,

Component 6
Figure 4: Analysis of the HCP data: heatmap for binary tensor components across brain regions. For each component, we plot the connection matrix A r = λ r a r ⊗ a r across the 68 brain nodes. the edges shown in tensor component 4 are mostly located within the frontal lobe, whereas the edges in component 5 connect the frontal lobe with parietal lobe. Figure 5: Analysis of the HCP data: edges with high loadings in the tensor decomposition components. We plot the top 10% edges with positive A r (i, j). The width of the edge is proportional to the magnitude of A r (i, j).

Conclusion
Many real-world tensors consist of binary observations. In this article, we have presented a new binary tensor decomposition method. Under suitable link functions, we have shown that the unknown parameter tensor can be accurately and efficiently recovered. When the infinity norm of the unknown tensor is bounded by a constant, our error bound is tight up to a constant and matches with what is best possible for unquantized observations. Next we comment on a number of remaining issues and possible extensions. First, for the optimization, we leverage on a block relaxation algorithm. Although it can not guarantee the global optimality, our numerical experiments have suggested that the converged points often have nearly optimal objective values. In fact, following the proof of Theorem 1, the performance bound holds as long as the likelihood atΘ is large enough, say, greater than or equal to L Y (Θ true ). When starting from random initializations, there could be multiple close-to-optimal choices ofΘ, with negligible difference between their objective values. In that case, any of those choices performs equally well in estimating Θ true . On the other hand, characterizing the global optimality for non-convex problems of this type has attracted a growing interest in the optimization community. Recent work has investigated non-convex optimization involving matrices and showed that there is no local optimum for a certain type of objective (Ge et al., 2016). Non-convex optimization involving tensors are generally harder than matrices (Anandkumar et al., 2014), and its global optimality remains an open problem.
Second, for the theory, we assume the true rank R is known, whereas we propose to estimate the rank using BIC given the data. Actually, as long as the estimated rank is no smaller than the true rank, all our theoretical results still hold valid. On the other hand, it remains an open and challenging question to establish the convergence rate of the estimated rank . We leave a full theoretical investigation of the rank selection consistency and the decomposition error bound under the estimated rank as our future research.
Finally, although we have concentrated on the Bernoulli distribution in this article, we may consider extensions to other exponential-family distributions, so to account for count-valued tensors, multinomial-valued tensors, or tensors with mixed types of entries. Moreover, our proposed method can be thought of as a building block for more specialized tasks such as exploratory data analysis, tensor completion, compressed object representation, and network link prediction. Exploiting the benefits and properties of binary tensor decomposition in each specialized task warrants future research.